mean-square error
- North America > United States (0.15)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > Canada (0.04)
- (5 more...)
- North America > United States (0.15)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > Canada (0.04)
- (5 more...)
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- (8 more...)
Maximum Likelihood With a Time Varying Parameter
Lanconelli, Alberto, Lauria, Christopher S. A.
When estimating unknown parameters in a dynamic model the optimum solution to the parameter estimation problem may not remain constant. Specifically, the optimal values of the model parameters may change through time because of the evolution of the underlying process: finding them is, in general, not straightforward. A survey of basic techniques for tracking the time-varying dynamics of a system is provided in [Ljung and Gunnarsson, 1990] where recursive algorithms in non-stationary stochastic optimization are analysed under different assumptions about the true system's variations, see also [Simonetto et al., 2020] for a review in a purely deterministic setting. In [Delyon and Juditsky, 1995] the problem of tracking the random drifting parameters of a linear regression system is tackled, and [Zhu and Spall, 2016] builds a computable tracking error bound for how a stochastic approximation with constant gain keeps up with a non-stationary target. Successively, [Wilson et al., 2019] introduces a framework for sequentially solving convex stochastic minimization problems, where the distance between successive minimizers is bounded. The minimization problems are then solved by sequentially applying an optimization algorithm, such as stochastic gradient descent (SGD). In a similar setting, [Cao et al., 2019] establishes an upper bound on the regret of a projected SGD algorithm with respect to the drift of the dynamic optima, while [Cutler et al., 2021] provides novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging.
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.51)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.51)
On the Importance of Sampling in Learning Graph Convolutional Networks
Cong, Weilin, Ramezani, Morteza, Mahdavi, Mehrdad
Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications. Despite their great success, training GCNs on large graphs suffers from computational and memory issues. A potential path to circumvent these obstacles is sampling-based methods, where at each layer a subset of nodes is sampled. Although recent studies have empirically demonstrated the effectiveness of sampling-based methods, these works lack theoretical convergence guarantees under realistic settings and cannot fully leverage the information of evolving parameters during optimization. In this paper, we describe and analyze a general \textbf{\textit{doubly variance reduction}} schema that can accelerate any sampling method under the memory budget. The motivating impetus for the proposed schema is a careful analysis for the variance of sampling methods where it is shown that the induced variance can be decomposed into node embedding approximation variance (\emph{zeroth-order variance}) during forward propagation and layerwise-gradient variance (\emph{first-order variance}) during backward propagation. We theoretically analyze the convergence of the proposed schema and show that it enjoys an $\mathcal{O}(1/T)$ convergence rate. We complement our theoretical results by integrating the proposed schema in different sampling methods and applying them to different large real-world graphs. Code is public available at~\url{https://github.com/CongWeilin/SGCN.git}.
Information theoretic limits of learning a sparse rule
Luneau, Clément, Barbier, Jean, Macris, Nicolas
We consider generalized linear models in regimes where the number of nonzero components of the signal and accessible data points are sublinear with respect to the size of the signal. We prove a variational formula for the asymptotic mutual information per sample when the system size grows to infinity. This result allows us to derive an expression for the minimum mean-square error (MMSE) of the Bayesian estimator when the signal entries have a discrete distribution with finite support. We find that, for such signals and suitable vanishing scalings of the sparsity and sampling rate, the MMSE is nonincreasing piecewise constant. In specific instances the MMSE even displays an all-or-nothing phase transition, that is, the MMSE sharply jumps from its maximum value to zero at a critical sampling rate. The all-or-nothing phenomenon has previously been shown to occur in high-dimensional linear regression. Our analysis goes beyond the linear case and applies to learning the weights of a perceptron with general activation function in a teacher-student scenario. In particular, we discuss an all-or-nothing phenomenon for the generalization error with a sublinear set of training examples.
- North America > United States (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (4 more...)
Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm
Spigler, Stefano, Geiger, Mario, Wyart, Matthieu
How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as $n^{-\beta}$ where $n$ is the number of training examples and $\beta$ an exponent that depends on both data and algorithm. In this work we measure $\beta$ when applying kernel methods to real datasets. For MNIST we find $\beta\approx 0.4$ and for CIFAR10 $\beta\approx 0.1$. Remarkably, $\beta$ is the same for regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we introduce the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption --- namely that the data are sampled from a regular lattice --- we derive analytically $\beta$ for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies, $\beta$ depends only on the training data and their dimension. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, our results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data should be defined in terms of how the distance between nearest data points depends on $n$. With this definition one obtains reasonable effective smoothness estimates for MNIST and CIFAR10.
- Europe > Switzerland > Vaud > Lausanne (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)